library(tidyverse)
library(GGally)
library(reshape)

Sobre forma e medida

Em análise exploratória, com frequência queremos examinar se há associação entre duas variáveis numéricas. Para descrever uma associação como essa, há pelo menos quatro características importantes:

  1. Formato da associação: linear, exponencial, parabólica, linear e depois assintótica, outro formato arbitrário, etc.
  2. Força da associação: correlação forte, fraca, nenhuma.
  3. Sinal da associação: correlação positiva ou negativa, quando é perceptível.
  4. Pontos extremos fora da associação.

Sumários x todos os pontos

Conselho importante: não há um número apenas que possa lhe responder tudo sobre a associação entre as duas variáveis. As medidas de correlação que você usará servirão mais para complementar e quantificar observações feitas em gráficos de dispersão do que para lhe guiar.

Um exemplo classico com quatro pares de variáveis cujas associações interessam: x1 e y1, x2 e y2, etc., o quarteto de Anscombe:

summary(anscombe)
       x1             x2             x3             x4           y1               y2              y3       
 Min.   : 4.0   Min.   : 4.0   Min.   : 4.0   Min.   : 8   Min.   : 4.260   Min.   :3.100   Min.   : 5.39  
 1st Qu.: 6.5   1st Qu.: 6.5   1st Qu.: 6.5   1st Qu.: 8   1st Qu.: 6.315   1st Qu.:6.695   1st Qu.: 6.25  
 Median : 9.0   Median : 9.0   Median : 9.0   Median : 8   Median : 7.580   Median :8.140   Median : 7.11  
 Mean   : 9.0   Mean   : 9.0   Mean   : 9.0   Mean   : 9   Mean   : 7.501   Mean   :7.501   Mean   : 7.50  
 3rd Qu.:11.5   3rd Qu.:11.5   3rd Qu.:11.5   3rd Qu.: 8   3rd Qu.: 8.570   3rd Qu.:8.950   3rd Qu.: 7.98  
 Max.   :14.0   Max.   :14.0   Max.   :14.0   Max.   :19   Max.   :10.840   Max.   :9.260   Max.   :12.74  
       y4        
 Min.   : 5.250  
 1st Qu.: 6.170  
 Median : 7.040  
 Mean   : 7.501  
 3rd Qu.: 8.190  
 Max.   :12.500  
# um pouco de rearrumação primeiro. 
# (para entender melhor, você pode procurar sobre dados em formato
# wide ou long no R)
t1 <- melt(select(anscombe, 1:4), id = c())
t2 <- melt(select(anscombe, 5:8), id = c())
names(t2) <- c("v2", "value2")
dados <- data.frame(vars = paste(t1$variable, "e", t2$v2), 
               v1 = t1$value, 
               v2 = t2$value2)

Calculando a correlação linear:

E se olharmos os dados?

ggplot(dados, aes(v1, v2)) + 
  geom_point(color = "darkorange", size = 4, alpha = 0.7) + 
  theme_bw() + 
  scale_x_continuous(breaks = seq(0, 20, 2)) + 
  scale_y_continuous(breaks = seq(0, 12, 2)) + 
  expand_limits(x = 0, y = 0) + 
  facet_wrap(~ vars)

Quatro relações diferentes, mesma quantificação. Para o segundo grupo, não há uma relação linear. No 3o, há uma relação perfeita entre a maioria das observações, com uma exceção. No 4o grupo não há relação; há uma exceção que faz parecer que há uma relação.

O que os outros coeficientes podem nos dizer?

dados %>% 
  group_by(vars) %>% 
  summarise(pearson = cor(v1, v2, method = "pearson"), 
            spearman = cor(v1, v2, method = "spearman"),
            kendall = cor(v1, v2, method = "kendall"))

Força e também direção

Dito isso, essa figura ajuda a lembrar a relação entre o valor esperado do coeficiente (linear) e vários tipos de associação entre duas variáveis:

da wikipedia

da wikipedia


Alguns testes gerando a relação

Relação linear:

set.seed(123)
x <- rnorm(100) * 100
tamanho_do_erro <- 50
y <- 0.5 * x + rnorm(100) * tamanho_do_erro + 20
df <- data.frame(x = x, 
                 y = y)
ggplot(df, aes(x = x, y = y)) + 
  geom_point(colour = "darkgrey", size = 4) + 
  theme_bw()

cor(df$x, df$y, method = "pearson")
[1] 0.66713
# Dobro de erro em torno de uma função linear f(x)
df$y <- 0.5 * x + rnorm(100) * tamanho_do_erro * 2 + 20
ggplot(df, aes(x = x, y = y)) + 
  geom_point(colour = "darkgrey", size = 4) + 
  theme_bw()

cor(df$x, df$y, method = "pearson")
[1] 0.3339445

Relação não linear

n = 100
df <- data.frame(x = runif(n, min= 1, max = 20))
df$y = 100 * exp(-1.2 * df$x) #+ rnorm(n, mean = 0.05, sd = 1)
ggplot(df, aes(x = x, y = y)) + 
  geom_point(colour = "darkgrey", size = 4) 

ggplot(df, aes(x = x, y = y)) + 
  geom_point(colour = "darkgrey", size = 4) + 
  scale_y_log10()  

# tente descomentando o scale_y_log10 acima e veja o que acontece
  
cor(df$x, df$y)
[1] -0.540947
cor(df$x, df$y, method = "spearman")
[1] -1
cor(df$x, df$y, method = "kendall")
[1] -1

Dica prática

Quando considerando várias variáveis, o mais prático é usar uma matriz de gráficos de dispersão:

ggpairs(dados)

 plot: [1,1] [===-----------------------------------------------------------------------------]  4% est: 0s 
 plot: [1,2] [======--------------------------------------------------------------------------]  8% est: 3s 
 plot: [1,3] [==========----------------------------------------------------------------------] 12% est: 8s 
 plot: [1,4] [=============-------------------------------------------------------------------] 16% est: 8s 
 plot: [1,5] [================----------------------------------------------------------------] 20% est: 7s 
 plot: [2,1] [===================-------------------------------------------------------------] 24% est: 6s 
 plot: [2,2] [======================----------------------------------------------------------] 28% est: 6s 
 plot: [2,3] [==========================------------------------------------------------------] 32% est: 5s 
 plot: [2,4] [=============================---------------------------------------------------] 36% est: 4s 
 plot: [2,5] [================================------------------------------------------------] 40% est: 4s 
 plot: [3,1] [===================================---------------------------------------------] 44% est: 4s 
 plot: [3,2] [======================================------------------------------------------] 48% est: 4s 
 plot: [3,3] [==========================================--------------------------------------] 52% est: 4s 
 plot: [3,4] [=============================================-----------------------------------] 56% est: 3s 
 plot: [3,5] [================================================--------------------------------] 60% est: 3s 
 plot: [4,1] [===================================================-----------------------------] 64% est: 2s 
 plot: [4,2] [======================================================--------------------------] 68% est: 2s 
 plot: [4,3] [==========================================================----------------------] 72% est: 2s 
 plot: [4,4] [=============================================================-------------------] 76% est: 2s 
 plot: [4,5] [================================================================----------------] 80% est: 1s 
 plot: [5,1] [===================================================================-------------] 84% est: 1s 
 plot: [5,2] [======================================================================----------] 88% est: 1s 
 plot: [5,3] [==========================================================================------] 92% est: 1s 
 plot: [5,4] [=============================================================================---] 96% est: 0s 
 plot: [5,5] [================================================================================]100% est: 0s 
                                                                                                            


names(diamonds)
 [1] "carat"   "cut"     "color"   "clarity" "depth"   "table"   "price"   "x"       "y"       "z"      
ggplot(diamonds, aes(x = carat, y = price)) + 
  geom_hex()

ggplot(diamonds, aes(x = price)) + 
  geom_histogram() + 
  scale_x_log10()

ggpairs(dados)

 plot: [1,1] [===-----------------------------------------------------------------------------]  4% est: 0s 
 plot: [1,2] [======--------------------------------------------------------------------------]  8% est: 1s 
 plot: [1,3] [==========----------------------------------------------------------------------] 12% est: 3s 
 plot: [1,4] [=============-------------------------------------------------------------------] 16% est: 4s 
 plot: [1,5] [================----------------------------------------------------------------] 20% est: 4s 
 plot: [2,1] [===================-------------------------------------------------------------] 24% est: 4s 
 plot: [2,2] [======================----------------------------------------------------------] 28% est: 4s 
 plot: [2,3] [==========================------------------------------------------------------] 32% est: 4s 
 plot: [2,4] [=============================---------------------------------------------------] 36% est: 4s 
 plot: [2,5] [================================------------------------------------------------] 40% est: 3s 
 plot: [3,1] [===================================---------------------------------------------] 44% est: 3s 
 plot: [3,2] [======================================------------------------------------------] 48% est: 3s 
 plot: [3,3] [==========================================--------------------------------------] 52% est: 3s 
 plot: [3,4] [=============================================-----------------------------------] 56% est: 3s 
 plot: [3,5] [================================================--------------------------------] 60% est: 3s 
 plot: [4,1] [===================================================-----------------------------] 64% est: 2s 
 plot: [4,2] [======================================================--------------------------] 68% est: 2s 
 plot: [4,3] [==========================================================----------------------] 72% est: 2s 
 plot: [4,4] [=============================================================-------------------] 76% est: 2s 
 plot: [4,5] [================================================================----------------] 80% est: 1s 
 plot: [5,1] [===================================================================-------------] 84% est: 1s 
 plot: [5,2] [======================================================================----------] 88% est: 1s 
 plot: [5,3] [==========================================================================------] 92% est: 0s 
 plot: [5,4] [=============================================================================---] 96% est: 0s 
 plot: [5,5] [================================================================================]100% est: 0s 
                                                                                                            

LS0tCnRpdGxlOiAiQWxndW1hcyBkaWNhcyBzb2JyZSBmb3JtYXMgZSBtZWRpZGFzIGRlIGNvcnJlbGHDp8OjbyIKYXV0aG9yOiAiTmF6YXJlbm8gQW5kcmFkZSIKZGF0ZTogIjEzIGRlIG1hcsOnbyBkZSAyMDE2IgpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sKLS0tCgpgYGB7ciwgbWVzc2FnZT1GQUxTRSwgd2FybmluZz1GQUxTRX0KbGlicmFyeSh0aWR5dmVyc2UpCmxpYnJhcnkoR0dhbGx5KQpsaWJyYXJ5KHJlc2hhcGUpCmBgYAoKIyMgU29icmUgZm9ybWEgZSBtZWRpZGEKCkVtIGFuw6FsaXNlIGV4cGxvcmF0w7NyaWEsIGNvbSBmcmVxdcOqbmNpYSBxdWVyZW1vcyBleGFtaW5hciBzZSBow6EgYXNzb2NpYcOnw6NvIGVudHJlIGR1YXMgdmFyacOhdmVpcyBudW3DqXJpY2FzLiBQYXJhIGRlc2NyZXZlciB1bWEgYXNzb2NpYcOnw6NvIGNvbW8gZXNzYSwgaMOhIHBlbG8gbWVub3MgcXVhdHJvIGNhcmFjdGVyw61zdGljYXMgaW1wb3J0YW50ZXM6IAoKICAxLiAqRm9ybWF0byogZGEgYXNzb2NpYcOnw6NvOiBsaW5lYXIsIGV4cG9uZW5jaWFsLCBwYXJhYsOzbGljYSwgbGluZWFyIGUgZGVwb2lzIGFzc2ludMOzdGljYSwgb3V0cm8gZm9ybWF0byBhcmJpdHLDoXJpbywgZXRjLgogIDIuICpGb3LDp2EqIGRhIGFzc29jaWHDp8OjbzogY29ycmVsYcOnw6NvIGZvcnRlLCBmcmFjYSwgbmVuaHVtYS4gCiAgMy4gKlNpbmFsKiBkYSBhc3NvY2lhw6fDo286IGNvcnJlbGHDp8OjbyBwb3NpdGl2YSBvdSBuZWdhdGl2YSwgcXVhbmRvIMOpIHBlcmNlcHTDrXZlbC4gCiAgNC4gKlBvbnRvcyBleHRyZW1vcyogZm9yYSBkYSBhc3NvY2lhw6fDo28uCgojIyMgU3Vtw6FyaW9zIHggdG9kb3Mgb3MgcG9udG9zCgpDb25zZWxobyBpbXBvcnRhbnRlOiBuw6NvIGjDoSB1bSBuw7ptZXJvIGFwZW5hcyBxdWUgcG9zc2EgbGhlIHJlc3BvbmRlciB0dWRvIHNvYnJlIGEgYXNzb2NpYcOnw6NvIGVudHJlIGFzIGR1YXMgdmFyacOhdmVpcy4gQXMgbWVkaWRhcyBkZSBjb3JyZWxhw6fDo28gcXVlIHZvY8OqIHVzYXLDoSBzZXJ2aXLDo28gbWFpcyBwYXJhIGNvbXBsZW1lbnRhciBlIHF1YW50aWZpY2FyIG9ic2VydmHDp8O1ZXMgZmVpdGFzIGVtIGdyw6FmaWNvcyBkZSBkaXNwZXJzw6NvIGRvIHF1ZSBwYXJhIGxoZSBndWlhci4gCgpVbSBleGVtcGxvIGNsYXNzaWNvIGNvbSBxdWF0cm8gcGFyZXMgZGUgdmFyacOhdmVpcyBjdWphcyBhc3NvY2lhw6fDtWVzIGludGVyZXNzYW06IHgxIGUgeTEsIHgyIGUgeTIsIGV0Yy4sIG8gW3F1YXJ0ZXRvIGRlIEFuc2NvbWJlXShodHRwczovL2VuLndpa2lwZWRpYS5vcmcvd2lraS9BbnNjb21iZSUyN3NfcXVhcnRldCk6CgpgYGB7cn0Kc3VtbWFyeShhbnNjb21iZSkKCiMgdW0gcG91Y28gZGUgcmVhcnJ1bWHDp8OjbyBwcmltZWlyby4gCiMgKHBhcmEgZW50ZW5kZXIgbWVsaG9yLCB2b2PDqiBwb2RlIHByb2N1cmFyIHNvYnJlIGRhZG9zIGVtIGZvcm1hdG8KIyB3aWRlIG91IGxvbmcgbm8gUikKdDEgPC0gbWVsdChzZWxlY3QoYW5zY29tYmUsIDE6NCksIGlkID0gYygpKQp0MiA8LSBtZWx0KHNlbGVjdChhbnNjb21iZSwgNTo4KSwgaWQgPSBjKCkpCm5hbWVzKHQyKSA8LSBjKCJ2MiIsICJ2YWx1ZTIiKQpkYWRvcyA8LSBkYXRhLmZyYW1lKHZhcnMgPSBwYXN0ZSh0MSR2YXJpYWJsZSwgImUiLCB0MiR2MiksIAogICAgICAgICAgICAgICB2MSA9IHQxJHZhbHVlLCAKICAgICAgICAgICAgICAgdjIgPSB0MiR2YWx1ZTIpCmBgYAoKQ2FsY3VsYW5kbyBhIGNvcnJlbGHDp8OjbyBsaW5lYXI6IAoKYGBge3J9CmRhZG9zICU+JSAKICBncm91cF9ieSh2YXJzKSAlPiUgCiAgc3VtbWFyaXNlKGNvcnJlbGFjYW8gPSBjb3IodjEsIHYyLCBtZXRob2QgPSAicGVhcnNvbiIpKQpgYGAKCkUgc2Ugb2xoYXJtb3Mgb3MgZGFkb3M/CgpgYGB7cn0KZ2dwbG90KGRhZG9zLCBhZXModjEsIHYyKSkgKyAKICBnZW9tX3BvaW50KGNvbG9yID0gImRhcmtvcmFuZ2UiLCBzaXplID0gNCwgYWxwaGEgPSAwLjcpICsgCiAgdGhlbWVfYncoKSArIAogIHNjYWxlX3hfY29udGludW91cyhicmVha3MgPSBzZXEoMCwgMjAsIDIpKSArIAogIHNjYWxlX3lfY29udGludW91cyhicmVha3MgPSBzZXEoMCwgMTIsIDIpKSArIAogIGV4cGFuZF9saW1pdHMoeCA9IDAsIHkgPSAwKSArIAogIGZhY2V0X3dyYXAofiB2YXJzKQpgYGAKClF1YXRybyByZWxhw6fDtWVzIGRpZmVyZW50ZXMsIG1lc21hIHF1YW50aWZpY2HDp8Ojby4gUGFyYSBvIHNlZ3VuZG8gZ3J1cG8sIG7Do28gaMOhIHVtYSByZWxhw6fDo28gbGluZWFyLiBObyAzbywgaMOhIHVtYSByZWxhw6fDo28gcGVyZmVpdGEgZW50cmUgYSBtYWlvcmlhIGRhcyBvYnNlcnZhw6fDtWVzLCBjb20gdW1hIGV4Y2XDp8Ojby4gTm8gNG8gZ3J1cG8gbsOjbyBow6EgcmVsYcOnw6NvOyBow6EgdW1hIGV4Y2XDp8OjbyBxdWUgZmF6IHBhcmVjZXIgcXVlIGjDoSB1bWEgcmVsYcOnw6NvLgoKTyBxdWUgb3Mgb3V0cm9zIGNvZWZpY2llbnRlcyBwb2RlbSBub3MgZGl6ZXI/IAoKYGBge3J9CmRhZG9zICU+JSAKICBncm91cF9ieSh2YXJzKSAlPiUgCiAgc3VtbWFyaXNlKHBlYXJzb24gPSBjb3IodjEsIHYyLCBtZXRob2QgPSAicGVhcnNvbiIpLCAKICAgICAgICAgICAgc3BlYXJtYW4gPSBjb3IodjEsIHYyLCBtZXRob2QgPSAic3BlYXJtYW4iKSwKICAgICAgICAgICAga2VuZGFsbCA9IGNvcih2MSwgdjIsIG1ldGhvZCA9ICJrZW5kYWxsIikpCmBgYAoKLS0tLS0tLS0tLS0tLQoKIyNGb3LDp2EgZSB0YW1iw6ltIGRpcmXDp8OjbwoKRGl0byBpc3NvLCBlc3NhIGZpZ3VyYSBhanVkYSBhIGxlbWJyYXIgYSByZWxhw6fDo28gZW50cmUgbyB2YWxvciBlc3BlcmFkbyBkbyBjb2VmaWNpZW50ZSAobGluZWFyKSBlIHbDoXJpb3MgdGlwb3MgZGUgYXNzb2NpYcOnw6NvIGVudHJlIGR1YXMgdmFyacOhdmVpczoKCiFbZGEgd2lraXBlZGlhXShodHRwczovL3VwbG9hZC53aWtpbWVkaWEub3JnL3dpa2lwZWRpYS9jb21tb25zL2QvZDQvQ29ycmVsYXRpb25fZXhhbXBsZXMyLnN2ZykKCgotLS0tLS0tLS0tLS0tCgojIEFsZ3VucyB0ZXN0ZXMgZ2VyYW5kbyBhIHJlbGHDp8OjbwoKUmVsYcOnw6NvIGxpbmVhcjoKCmBgYHtyfQpzZXQuc2VlZCgxMjMpCnggPC0gcm5vcm0oMTAwKSAqIDEwMAp0YW1hbmhvX2RvX2Vycm8gPC0gNTAKeSA8LSAwLjUgKiB4ICsgcm5vcm0oMTAwKSAqIHRhbWFuaG9fZG9fZXJybyArIDIwCgpkZiA8LSBkYXRhLmZyYW1lKHggPSB4LCAKICAgICAgICAgICAgICAgICB5ID0geSkKCmdncGxvdChkZiwgYWVzKHggPSB4LCB5ID0geSkpICsgCiAgZ2VvbV9wb2ludChjb2xvdXIgPSAiZGFya2dyZXkiLCBzaXplID0gNCkgKyAKICB0aGVtZV9idygpCgpjb3IoZGYkeCwgZGYkeSwgbWV0aG9kID0gInBlYXJzb24iKQoKIyBEb2JybyBkZSBlcnJvIGVtIHRvcm5vIGRlIHVtYSBmdW7Dp8OjbyBsaW5lYXIgZih4KQpkZiR5IDwtIDAuNSAqIHggKyBybm9ybSgxMDApICogdGFtYW5ob19kb19lcnJvICogMiArIDIwCgpnZ3Bsb3QoZGYsIGFlcyh4ID0geCwgeSA9IHkpKSArIAogIGdlb21fcG9pbnQoY29sb3VyID0gImRhcmtncmV5Iiwgc2l6ZSA9IDQpICsgCiAgdGhlbWVfYncoKQpjb3IoZGYkeCwgZGYkeSwgbWV0aG9kID0gInBlYXJzb24iKQpgYGAKClJlbGHDp8OjbyBuw6NvIGxpbmVhcgoKYGBge3J9Cm4gPSAxMDAKZGYgPC0gZGF0YS5mcmFtZSh4ID0gcnVuaWYobiwgbWluPSAxLCBtYXggPSAyMCkpCmRmJHkgPSAxMDAgKiBleHAoLTEuMiAqIGRmJHgpICMrIHJub3JtKG4sIG1lYW4gPSAwLjA1LCBzZCA9IDEpCgpnZ3Bsb3QoZGYsIGFlcyh4ID0geCwgeSA9IHkpKSArIAogIGdlb21fcG9pbnQoY29sb3VyID0gImRhcmtncmV5Iiwgc2l6ZSA9IDQpIAoKZ2dwbG90KGRmLCBhZXMoeCA9IHgsIHkgPSB5KSkgKyAKICBnZW9tX3BvaW50KGNvbG91ciA9ICJkYXJrZ3JleSIsIHNpemUgPSA0KSAjICsgc2NhbGVfeV9sb2cxMCgpICAKCiMgdGVudGUgZGVzY29tZW50YW5kbyBvIHNjYWxlX3lfbG9nMTAgYWNpbWEgZSB2ZWphIG8gcXVlIGFjb250ZWNlCiAgCmNvcihkZiR4LCBkZiR5KQpjb3IoZGYkeCwgZGYkeSwgbWV0aG9kID0gInNwZWFybWFuIikKY29yKGRmJHgsIGRmJHksIG1ldGhvZCA9ICJrZW5kYWxsIikKYGBgCgoKLS0tLS0tCgojIyBEaWNhIHByw6F0aWNhCgpRdWFuZG8gY29uc2lkZXJhbmRvIHbDoXJpYXMgdmFyacOhdmVpcywgbyBtYWlzIHByw6F0aWNvIMOpIHVzYXIgdW1hIG1hdHJpeiBkZSBncsOhZmljb3MgZGUgZGlzcGVyc8OjbzogCgpgYGB7cn0KZGFkb3MgPC0gcmVhZC5jc3YoImRhZG9zLy9EYWRvcyBkZSBhbHVub3MgcGFyYSBhcyBhdWxhcyBkZSBGUENDLXJlcG9ydC5jc3YiKQpkYWRvcyA8LSBkYWRvcyAlPiUgc2VsZWN0KDIsIDMsIDUsIDcsIDkpICU+JSBmaWx0ZXIoY29tcGxldGUuY2FzZXMoZGFkb3MpKQpuYW1lcyhkYWRvcykgPC0gYygiY3Vyc28iLCAic2V4byIsICJyZXBvc2l0b3Jpb3MiLCAiY29uZmlhbmNhLmVtLmVzdGF0aXN0aWNhIiwgCiAgICAgICAgICAgICAgICAgICJhbHR1cmEiKQoKc3RyKGRhZG9zKQoKZ2dwYWlycyhkYWRvcykKYGBgCgotLS0tLS0KCmBgYHtyfQpuYW1lcyhkaWFtb25kcykKCmdncGxvdChkaWFtb25kcywgYWVzKHggPSBjYXJhdCwgeSA9IHByaWNlKSkgKyAKICBnZW9tX2hleCgpCgpnZ3Bsb3QoZGlhbW9uZHMsIGFlcyh4ID0gcHJpY2UpKSArIAogIGdlb21faGlzdG9ncmFtKCkgKyAKICBzY2FsZV94X2xvZzEwKCkKCmdncGFpcnMoZGFkb3MpCmBgYAoK